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Over the last 10-15 years a general understanding of the chemical reaction of protein folding 
has emerged from statistical mechanics. The lessons learned from protein folding kinetics based on 
energy landscape ideas have benefited protein structure prediction, in particular the development 
of coarse grained models. We survey results from blind structure prediction. We explore how 
second generation prediction energy functions can be developed by introducing information from an 
ensemble of previously simulated structures. This procedure relies on the assumption of a funnelled 
energy landscape keeping with the principle of minimal frustration. First generation simulated 
structures provide an improved input for associative memory energy functions in comparison to the 
experimental protein structures chosen on the basis of sequence alignment. 



Every other summer, research groups compare their 
different protein structure prediction methods via the 
Critical Assessment of Techniques for Protein Structure 
Prediction (CASP) experiment. During the CASP ex- 
periment, sequences of experimentally determined pro- 
tein structures that are not public available are placed 
on the web. This exercise is double blind where neither 
the organisers nor the participants know the experimen- 
tally determined structure. Groups respond with up to 5 
ranked predictions, before a predetermined date, such as 
the publication of the structures. Since the inception 
of CASP, three dimensional structure prediction cate- 
gory has expanded to address related prediction ques- 
tions such as sequence to structure alignment quality, 
amino acid sidechain placement, multi-domain domain 
boundaries, and the ordered or disordered nature of a 
protein sequence . 

These different prediction questions can be examined 
from a common framework: the principle of minimal frus- 
tration. The principle of minimal frustration states that 
native contacts must be more favourable, in a strict sta- 
tistical sense 0, than non-native contacts in order for 
proteins to fold on physiologic time scales With- 
out a sufficient energetic bias towards the native state, 
the multi-dimensional energy surface as a function of 
native structure possesses too many minima for an ef- 
ficient stochastic search. Such an energy surface would 
lead to slow folding kinetics, even if the proteins ever 
found a sufficiently stable native state. This is not true 



since we know most proteins fold without assistance 
The opposite of a rough energy surface is biased toward 
the native basin without any local minima is an abso- 
lute manifestation of the principle of minimal frustration. 
Funnelled energy surfaces have no unfavourable energetic 
traps {i.e. Go Models) have been shown to reproduce 
most features of experimental folding kinetics [a, 0, Q • 
These energy landscape concepts can richly be applied 
in several areas of chemistry and physics Apparently, 
evolution's energy function is minimally frustrated. 

The correlation between a protein sequence and its 
three dimensional structure can be described using sim- 
ilar landscape language. As a protein sequence diverges 
away from a consensus wild type sequence, the poten- 
tial for energetically unfavourable interactions increases. 
The wild type sequence and its homologues will fold to- 
ward the same native basin. Only once enough frustrat- 
ing contacts are added to the wild type sequence will the 
sequences no longer correspond to the native state en- 
semble. Sequences with over 25% sequence identity to 
previously determined protein structures are called com- 
parative modelling targets. The energy landscape under- 
lying such a prediction is a Go model based on the struc- 
ture of the known homologue. This heavily funnelled 
energy surface yields high resolution structures, with the 
discrepancies the turns and residues which have poor se- 
quence to structure alignments. Fig. ^ demonstrates the 
distribution of homology of proteins sequence to known 
structures included in CASP6. Since proteins below 25% 
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Distribution of the Difficulty of CASP Targets 
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FIG. 1: The difficultly of the prediction targets as defined by FIG. 2: The ab initio prediction targets amino acid lengths 
percent identity. Proteins below 25% sequence identity are for CASP6. 
usually considered ab initio or fold recognition targets. 



sequence identity are considered new fold recognition tar- 
gets, 70% of the structures were comparative modelling 
targets. Recently sequenced genomes such as E. coli have 
the same ratio of ab initio to comparative modelling tar- 
gets, which suggests the analysis of this ratio over time 
could be a useful measure of the progress of efforts to 
experimentally find examples of all of Nature's protein 
structures. 

In contrast to comparative modelling, ab initio struc- 
ture predictions do not have the advantage of creating 
Go like energy surfaces. While many ab initio targets 
contain less than 150 residues, and thus are candidates 
for standard techniques, there are several that are longer 
as shown in Fig. |2| Most longer sequences will be multi- 
domain proteins. This causes new problems. Folding 
a protein with two hydrophobic cores allows for new 
sources of frustration, beyond those present in single do- 
main proteins. To obtain predictions for such problem- 
atic sequences, they usually must be divided into their 
constituent domains. Current methods for dividing the 
sequence into domains range from purely sequence based 
algorithms, which look for sequence patterns in multiple 
sequence alignments, to simulation techniques that look 
for hydrophobic core formation amongst multiple inde- 
pendent simulations 

The case studies we highlight of difficult structure pre- 
dictions were chosen from our participation in the CASP5 
and CASP6 experiments. In CASP5, we utilised several 
improved techniques, such as a backbone hydrogen bond 
term for the proper formation of beta sheets, and a liq- 
uid crystal like term to ensure parallel or anti-parallel 
sheet formation 12]. We also performed target sequence 
averaging which enhances the funnelling of the predic- 
tion landscape 0|, and assessed our ensemble of sam- 
pled structures with a twenty letter contact for submis- 
sion Our most striking result from this round of 
blind prediction was a prediction for target T0170 pro- 
tein databank 15! code (PDB ID lUZC). Fig. |3| presents 
the sequence dependent overlay of our Model 1 structure 
with the experimentally determined structure. The se- 



quence dependent alignment quality of this structure is 
high as measured by a Q score of 0.38. Q is an order 
parameter defined in Eq. ^ that measures the sequence 
dependent structural complementarity of two structures, 
where Q is defined as a normalised summation of C-alpha 
pairwise contact differences. 
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The resulting order parameter, Q, ranges from 0, when 
there is no similarity between structures at a pair level, 
to 1 which is an exact match. Q has been shown to be 
more sensitive in determining the quality of intermedi- 
ate quality protein structure predictions Q scores 
of 0.4 for single domain proteins equals an RMSD of 5A. 
In most cases the reference state for the Q score is the 
native state, but often one wants to compare structural 
similarity between structures in a simulation. A sequence 
independent measure CE 16J, also scores well(CE Z-score 
— 4.1). The CE Z-score measures structural complemen- 
tarity without regard to sequence information, and is pa- 
rameterised such that structures between with a Z-Score 
greater than 4 belong to the same protein structure fam- 
ily. The contact map of the prediction, Fig. 0] which 
identifies all of the C-alpha intermolecular interactions 
within 9A where the axes are the index of the protein, 
shows the correct packing of the helices. 

Fig. [3 shows the size of partially correct continuous in 
sequence segments under an RMSD cutoff. When com- 
pared against the other predictions, our Model 1 predic- 
tion (Dark Blue) was amongst the best of all submitted 
structures. Also the relative success of the prediction, 
classifies this target as being of moderate difficulty. In 
this example CASP demonstrates small (70 residues) all 
alpha proteins are beginning to be successfully predicted 
by a variety of ab initio techniques. 
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FIG. 3: Sequence dependent superpositions of Model 1 struc- 
ture against the native state for CASP5 target T0170 (PDB 
ID lUZC). Blue represents the prediction and the native state 
is represented with red. 




FIG. 4: Contact map of target T0170 (PDB ID lUZC) model 
1 structure against the NMR structure. 
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FIG. 5: Percentage of residues under a RMSD limit. (Dark 
Blue - Model 1, Light Blue - Model 2-5, Orange - Other 
Groups Prediction) 



Methods 
Energy Functions and Sampling 

We used an Associative Memory Hamiltonian ( AMH) , 
with optimised parameters to sample and predict struc- 
tures [13, m, [13 • The AMH uses a reduced descrip- 
tion of the amino acid chain in order to gain the orders 
of magnitude computational acceleration over all atom 
models needed to fold moderate length proteins with or- 
dinary computational resources, and has been described 
in great detail before This is possible due to re- 

ducing the number of atoms per residue from over 10 
to only three backbone atoms: the Cq,,C/3, and O. The 
remaining backbone heavy atoms {N, C') can be recon- 
stituted using the ideal geometry of the peptide bond as a 
template. Also we reduced the complexity of the amino 
acid code from twenty letters, to four. We chose the 
four letter code, which has the advantage of preserving 
a diversity of contacts, because it is still simple enough 
that the number of coefhcients that need to be optimised 
does not create problems of inaccurate statistics due to 
limits of interactions encountered in the molten globule 
state. Specifically four amino acid classes are defined: 
hydrophilic (A, G, P, S, T), hydrophobic (C, I, L, M, F, 
W, Y, V), acidic (N,D,Q,E), and basic (R,H,K) 20]. The 
optimisation procedure produces an energy landscape 
that discriminates the native state from misfolded states, 
while avoiding kinetic traps reasonably well [2, [^ • The 
AMH is an analogue to the neural networks designed by 
Hopfield to synthesise information from multiple previ- 
ous experiences [2^. This energy function recalls struc- 
tural patterns in a set of known protein structures. The 
Hamiltonian produces an energetically favourable mini- 
mum when there is sufficient coherence between a set of 
three dimensional protein structures. 

The AMH energy function, in its most general sense, 
consists of a backbone term, i?back and interaction term, 
i?i„t defined by, 

-Etotal = -Bback + -^int • (2) 

The backbone energy term consists of several terms that 
reproduce the self-avoiding behaviour of the polypeptide 
chain give by, 

£^back = — (£^SHAKE + ^'rama + E^v + i^chain + £'chi)- (3) 

As in many molecular mechanics energy functions, cova- 
lent bonds are preserved by using the SHAKE algorithm 
23] £^SHAKE, which enables an increase of the time step 
size, and eliminates the need for a traditional harmonic 
calculation. The SHAKE algorithm preserves the dis- 
tances between neighbouring Ca-Cp, and Ca-O atoms. 
The neighbouring residues limit the variety of angles the 
backbone atoms can occupy, producing a Ramachadran 
plot [23|. This distribution of angles is reinforced by a 
potential, -Brama with low barriers to encourage rapid lo- 
cal backbone movements. Another term, E^v maintains 
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a sequence specific excluded volume constraint between 
Cq-Cq, C 13-0/3, O-O, Ca-Ci3 atoms. The chain connec- 
tivity, and planarity of the peptide bond due to resonance 
is ensured by means of a harmonic potential, i?chain- Also 
the chirality of the Cq, due to its four different bonding 
partners is maintained using scalar product of neighbour- 
ing unit vectors of carbon and nitrogen bonds, i?chi- 

While i?back creates peptide like stereo-chemistry, it 
does not introduce the majority of the attractive inter- 
actions that result in folding. Such interaction are sup- 
plied by the rest of the potential E\at. The interactions 
described by Ei^t depends on the sequence separation 
\i — j\. Specifically, they are divided into three proxim- 
ity classes x{\i — x — short {\i — j\ < b) , x = medium 
(5 < I? — j\ < 12) and x = long {\i — j\ > 12) as defined 
by Eq.H 
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long- 



(4) 



Also these distance classes are also referred to as local, 
super-secondary, and tertiary, respectively. 

The AMH interaction potential ii^int is based on corre- 
lations between a target's sequence signified by and 
the sequence-structure patterns in a set of memory pro- 
teins /i represented as and a pairwise contact po- 
tential. The pairs in the target and in the memory are 
first associated using a sequence-structure threading al- 
gorithm . The database is assumed to contain a sub- 
set of pair distances, which may match the associated 
pair distances in the target structure. The general form 
of the associative memory interaction is: 
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where the similarity between target pair distances 
with aligned memory pair distances r^^, is measured 
by Gaussian functions whose widths are given by (7y = 
\i — jf'^^ A. The set of parameters, 7, encode the simi- 
larity between residues i and j , and the memories residues 
i' and j'. Favourable interactions occur during coherence 
in the distances achieved in the sequence to structure 
alignments. The encoding of the alignment information 
in Eq. is only an example of what is used for the all- 
alpha energy functions. Other encodings have been used 
in the alpha-beta energy function JT5| to improve the dis- 
crimination between helices and strands. While the first 
term in Eq. |S1 is the superposition of interactions over a 
set of experimentally determined structures, it also shares 
a dependence on the sequence separation between the in- 
teracting residues. For residues separated by greater than 
12 residues, a contact potential i?iong, as described by the 



second term in Eq. |S1 which does not depend on interac- 
tion information from the structures used to define local 
in sequence interactions. In this term Ck{N) represents 
a sequence length dependence scaling to account for the 
variation in probability distributions based on sequence 
length. Five wells instead of the three defined here by 
Ukirij) determine interactions in the alpha-beta energy 
function . Energy units e are defined excluding back- 
bone contributions in terms of a native state energy in 
Eq.H 
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where N is the number of residues. A distance class 
scaling a, is constant in each of the energy classes because 
they are designed to be equal during the optimisation 

The solvent in these energy functions is treated in a 
mean field manner, where the implicitly solvated native 
states of the proteins define the energy gap to the molten 
globule state. Solvent effects are also present in the se- 
quence to structure alignment energy functions, but they 
are not explicitly represented in the molecular dynamics 
energy function. Water mediated contacts with an ex- 
panded 20 letter code in the contact potential were in- 
troduced '25l , based upon previous work which examined 
protein recognition |2S*i22i] • The water mediated contacts 
along with a new one dimensional burial term has shown 
promising results especially for long proteins. 

Once the energy function is optimised, the minima of 
the energy function are probed via simulated annealing 
with molecular dynamics simulations. This minimisation 
technique integrates Newton's equations of motions to 
determine the energy of the next time step. Simulated 
annealing slowly reduces the temperature from a high 
value as in the tempering of steel in metallurgy. This 
minimisation algorithm allows for local searches, while 
allowing modest energy barriers to be overcome. 

Energy landscape ideas have generated an optimisa- 
tion scheme for creating funnelled energy surfaces. While 
funnelled, the parameterisation does not eliminate all 
non-native minima. The superposition of several energy 
surfaces reduces the likelihood of such trapping in local 
minima j23, ■ The flexibility of the AMH framework 
provides several ways of incorporating multiple sequence 
alignment information. Some of the options include cre- 
ating a consensus sequence [l^ . simulating different ho- 
mologue sequences concurrently, and averaging the re- 
sulting forces and energies The averaged AMH en- 
ergy function we used average the forces and the energies 
of these simulation over a set of sequences, because it al- 
lows for more generalisable results than may occur with 
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other techniques, and is described as in Eqs.jTJIHl, 
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1 

(8) 

To superimpose multiple energy landscapes, we need 
a multiple sequence alignment to a set of sequence ho- 
mologue. Sequences homologous to the target sequence 
are first identified by using PSI-Blast with default pa- 
rameters .3Qj ■ Each sequence above and below a certain 
sequence identity thresholds (70% 30% in this work) is 
then aligned against each other, and proteins that have 
greater than 90% sequence identity to other identified 
sequence homologues are removed. The culling of the 
sequence homologues via open source bioinformatic li- 
braries is necessary for two reasons [sJl- Some classes 
of proteins have a large number of sequence homologues, 
and performing a multiple sequence alignment can be im- 
practical. Also removing sequence homologues attempts 
to remove biases introduced when there are few homo- 
logues. The remaining sequences were aligned using a 
multiple sequence alignment algorithm js^. Within the 
AMH energy function, gaps occurring in a sequence align- 
ment could be addressed in a variety of ways, in this work 
gaps in the target sequence are ignored, while gaps within 
homologues are completed with residues from the target 
protein. This strategy may introduce biases toward the 
target sequence, but this approach is preferred to per- 
haps ignoring interactions. Fig. Elshows a representative 
multiple sequence alignment for a target, coloured with 
respect to the four letter code of the AMH. If one focuses 
on the hydrophobic yellow residues, the alternating hy- 
drophobic hydrophilic patterns for beta strands forma- 
tion are apparent. 

Another way of introducing the characteristics of mul- 
tiple funnelled energy landscapes is using information de- 
rived from neural networks trained on multiple sequence 
alignments. Even with different architectures, neural net- 
works typically achieve 75% accuracy when predicting 
secondary structure. Recently it has been shown art- 
ful combinations of two different predictions can slightly 
improve the results [33i |. This secondary structure infor- 
mation was added by a biasing energy function to either 
a helix or a strand via, Eq^^ = 105e(Q-g,,)4 13], where 
Qss is defined by Eq. El 
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FIG. 6: Multiple sequence alignment for target T0212 (PDB 
ITZA) coloured with respect to a four letter code, where red 
represents acidic residues, blue represents polar residues, yel- 
low represents nonpolar residues, and green represents basic 
residues. 



Qss is takes the same form of the Q define before in Eq. 
except that potential acts over n independent secondary 
structures units derived from secondary structure pre- 
diction. The distances that define energy minimum, r^J 
are determined from experimentally determined Carte- 
sian distances. Previously in an effort to incorporate this 
secondary structure information, the Ramachandran po- 
tential has been altered to bias the backbone The 
local in sequence potential Eq^^ is preferred to the Ra- 
machandran potential biasing because it avoids SHAKE 
violations when the strength of the bias is increased. 

For most selected CASP6 targets, we followed the same 
protocol. We averaged the AMH potential over multi- 
ple sequence homologues when they were available. In 
most cases, information from secondary structure pre- 
diction was used to bias secondary structure units to 
their predicted structures. Molecular dynamics with sim- 
ulated annealing sampled low energy structures. Also 
constant temperature slightly above the predicted glass 
temperature were used to generate candidate structures. 
We collected structures above Tk, which usually gives 
the fastest folding thereby compromising between the 
funnelled and glassy behaviour of the energy function. 
Once the kinetics of the structure slows, the diversity 
of structures encountered disappears. The slow kinet- 
ics regime typically predominates around a temperature 
of 0.9. While using a linear annealing schedule up to 
Tk, about 25 different collapsed structures were collected 
during each simulation. The amount of sampling per- 
formed for each structure varied from about 500 to 20,000 
different structures. While this was roughly 50 times 
more sampling than we had previously performed in the 
CASP setting, it is dwarfed by the efforts of others who 
can sample in the millions of structures by using more 
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TABLE L Linear Regression of Hydrophobic Burial Energy TABLE IL Linear Regression of Mscore 



Proteins 


fold class 


Correlation CoefUcient 




Proteins 


fold class 


Correlation Coefficient 


1R69 


a 


.22 




1R69 


a 


.29 


1BG8 


a 


.33 




1BG8 


a 


.04 


lUTG 


a 


.63 




lUTG 


a 


.26 


IMBA 


a 


.40 




IMBA 


a 


.26 


2MHR 


a 


.46 




2MHR 


a 


.10 


IIGD 


a/P 


-.70 




IIGD 


a/13 


.37 


3IL8 


a/P 


-.06 




3IL8 


a/P 


.13 


ITIG 


a/p 


.02 




ITIG 


a/p 


.19 


IBFG 




.16 




IBFG 




.08 


ICKA 


P 


-.14 




ICKA 


P 


.03 


1JV5 





.11 




1JV5 





-.07 


IKOS 




.27 




IKOS 




-.10 



powerful computational resources [3a|- Subsequently, a 
smaller subset of structures was selected for submission 
by evaluating the size of the hydrophobic core and the hy- 
drophilic surface area. Further selection criteria included 
visual inspection, agreement with the preliminary sec- 
ondary structure prediction, and low energies predicted 
from a second optimised contact energy function. 



Selection of Structures 

To select candidate structures from independent simu- 
lated annealing or constant temperature trajectories, we 
calculated both the buried hydrophobic surface area and 
the exposed hydrophilic surface area along the trajectory. 
In an effort to calculate the buried or exposed surface 
area, we assigned residues which have greater than the 
mean total surface area as solvent exposed, and the con- 
verse as solvent buried. We scaled each surface area by 
a weight to represent the likehhood of amino acid burial. 
It was modelled to the free energy cost of transferring 
each amino acid from octanol to water |36j in an effort 
to introduce a sequence specificity as shown in Eq. 1101 



N 



Burial 



7i * SAi, if SAi > total surface 



0, 



if SAi > total surface 



(10) 



This normalisation is desirable because the surface ac- 
cessibility is calculated from our minimal Cq,C^, and 
O atoms, which produces amino acids of the same vol- 
ume. Such an energy term would be more valuable if 
non-additive interactions, and a larger number of hy- 
dration layers were added. The unavoidable inaccura- 
cies in atomistic force fields, and the slow glassy kinetics 
of sidechain rearrangements prevented any completion of 
the backbone and sidechains with all-atoms or minimisa- 
tion of putative structures ^7] . 

Another parameter we used after sampling to select 
and examine structures was based on sequence specific 



backbone probabilities. The specificity of local interac- 
tions have been fruitful for improving collapsed proteins 
structure predictions [s^. In a similar spirit sequence 
specific nearest neighbour probabilities were also used 
[331 ■ Local signals have also been theoretically shown to 
contribute roughly a third of the total folding gap for a 
helical proteins 40] . Similarly we started looking at such 
probabilities to further improve the backbone potential 
of the AMH, but without needing secondary structure 
prediction. 



Eti-i 



JV-l 



LogP{i~l,i,i + l,<j>,i^) (11) 



Somewhat surprisingly, the summation of the resulting 
log probabilities from 4,012 highly resolved protein struc- 
tures could be used as an additional measure as part of 
a strategy for the selection of structures out of an en- 
semble. Table U shows the linear correlation coefficients 
between structures of varying Q-scores, sampled above 
Tk which is where the best predictions usually occur be- 
fore glassy dynamics dominates the kinetics. For both 
proteins with all a, and a//3 compositions, the summed 
log probabilities provide discrimination, but not within 
the all /3 folds. These results shown in Table Hll echo the 
previous findings in terms of the 0, ^ probability maps 
and also that all beta structures are less well predicted 
when a dihedral angle energy function is minimised. The 
weakness of nearest neighbour excluded volume effects to 
determine local structure is also demonstrated in the con- 
sistent weakness of secondary structure prediction with 
respect to beta strands. Alpha helices are correctly pre- 
dicted to roughly 80% accuracy while beta strands av- 
erage 60% accuracy by such pure sequence based algo- 
rithms. The difficulty of predicting some circular dichro- 
ism spectroscopy results for beta to coil transitions can 
also be attributed to the weakness of the local backbone 
excluded volume interactions. 
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Results 
Blind Simulations 

For ab initio blind predictions in CASP6, we selected 
sequences if there were no experimentally determined ho- 
mologous structures found by automated comparative 
modelling servers. The overall results for the ab initio 
structure prediction simulation are summarised in Ta- 
ble where the abbreviations are length = the number 
of amino acids, temp = temperature where best structure 
was encountered, sub Q or samp Q = the best sampled 
and submitted structures respectively as a judged by a 
function of Q, and traj = number of independent trajec- 
tories simulated. The CASP6 targets are classified un- 
der the following categories (NF=new fold, FR/A=fold 
recognition analog, FR/H=fold recognition homologue, 
CM/H=comparative modelling hard). Targets T0207, 
and T0270 where removed from the experiment so their 
CASP class are undefined. Structures for T0207 and 
T0272-b were not submitted. There are a few main points 
from this data. Using a Q of 0.4 as a measure successful 
prediction, we were able to encounter high quality struc- 
tures for 4 targets and nearly so for 4 others. The tem- 
perature at which the best structures were sampled was 
between the 1.2 and 0.8, which is the annealing regime 
we investigated most throughly. This suggests our an- 
nealing schedules were close to the behaviour we sought 
a priori. The longer the length of the target sequence 
clearly reduced the quality of our predictions. Also the 
proteins where we had a greater number of trajectories 
naturally showed better structures. A final observation 
identifies the difference between the best submitted struc- 
ture and the best sampled structure as disappointingly 
large for some of the targets. This can be attributed our 
strategy of maximising the number of simulations per- 
formed rather than more carefully studying our trajecto- 
ries. This difference would be smaller if greater care was 
taken in the selection of the structures, but the number 
of high quality structures would have been less. 

Calculating the free energy of several randomly chosen 
CASP6 targets in Fig. [7| provides us with probabilities of 
what we would have expected to see if more simulations 
has been performed during the CASP season. We can 
estimate how many independent structures need to be 
seen at this temperature to sample the region 10 fc^T 
greater than the minimum of the free energy. We see 
roughly e^*^ « 2 * 10"* independent sampled structures 
would be needed at a temperature of 1.0. Target T0242 
(PDB ID 2BLK) illustrates why the best structure we 
encountered had a Q score of 0.3. For this target, we 
sampled roughly 7000 different structures. To achieve a 
Q of 0.45, according to the free energy analysis we would 
need to increase our sampling by a factor of 3. 

When extrapolating to lower temperatures, we see 
lower barriers to the folded state, and thus if sampling 
were more complete one would see better structures at 
these temperatures. This further cooling would be a 



TABLE III: CASP6 Results: Best Submitted and Sampled 
Structures 



target 


length 


fold 


sub Q 


samp Q 


temp 


traj 


CASP 


T0281 


70 


a/p 


.34 


.48 


0.85 


986 


NF 


T0201 


94 




.36 


.44 


1.39 


199 


NF 


T0212 


123 




.26 


.42 


1.30 


97 


FR/A 


T0230 


102 


a//3 


.31 


.42 


1.05 


395 


FR/A 


T0207 


76 


a/d 




.39 


0.98 


297 




T0224 


87 




.30 


.38 


1.20 


501 


FR/H 


T0263 


97 


aid 


.34 


.38 


0.94 


404 


FR/H 


T0272-a 


85 


a//3 


.30 


.37 


0.94 


30 


FR/A 


T0265 


102 


a//3 


.29 


.34 


0.83 


374 


CM/H 


T0213 


103 


q//3 


.26 


.32 


0.98 


448 


FR/H 


T0243 


88 


a//3 


.31 


.32 


0.95 


418 


FR/H 


T0239 


98 


a//3 


.25 


.32 


0.99 


424 


FR/A 


T0214 


110 


a//3 


.24 


.30 


0.41 


348 


FR/H 


T0242 


115 


Q//3 


.27 


.30 


0.89 


358 


NF 


T0270-b 


125 


a//3 


.27 


.28 


0.99 


32 




T0270-a 


122 




.25 


.27 


0.80 


47 




T0272-b 


124 






.26 


0.81 


34 


FR/A 


T0273 


186 


a//3 


.22 


.24 


0.98 


189 


NF 



CASP6 Target Free Energies at Temperature of 1 .0 




FIG. 7: Free Energy calculations for CASP6 targets T0213, 
T0214, T0224, T0242, and T0243. 



favorable strategy except that dynamic slowing due to 
the approach of the glass transition interferes, which 
occurs at a temperature of 0.9. Naturally, it is best 
to sample just above the glass transition temperature, 
which can be approximately found from Q-Q correlation 
(< Q{t)Q{t + t) >) liJ], and by using the Kolmogorov- 
Smirnov test to asses the independence of samples 42] . 
Table Irvl indicates what was the best structure we would 
be likely to see under such sampling conditions. The 
differences between thermodynamically accessible struc- 
tures and those that were sampled suggests that in- 
creased simulations would have improved the best struc- 
tures sampled considerably. The free energy of target 
T0243 (PDB ID not available) is significantly different 
due to its unusual architecture that contains a buried 
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TABLE IV: Likely Quality of Structure Seen at a Free Energy 
of 10 CASP6 



Target 


PDB 


length 


Probable Q 


Sampled Q 


T0213 


1TE7 


103 


.43 


.32 


T0214 


1S04 


110 


.40 


.30 


T0224 


IRHX 


87 


.39 


.38 


T0242 


2BLK 


123 


.45 


.30 


T0243 




88 


.28 


.32 



helix. 

As in Fig. 0] we compare contact maps between the 
predictions and the experimentally resolved structure. 
Often contact maps give more insightful than superim- 
posed structures especially when viewing in 2 dimen- 
sions. We compare the submitted structures with the 
best structure encountered during our sampling to deter- 
mine what aspect of folding are being captured by our en- 
ergy functions. For a short target T0201 (PDB ID 1S12), 
we see that sometimes a small difference in the contact 
maps in Fig. |H1 can greatly improve the quality of the 
prediction even though a large number of contacts are 
already correct. There was a larger fraction of incorrect 
contacts in our best submitted structure for target T0230 
(PDB ID IWCJ) than we would have seen in the best 
generated structure as shown in Fig. |2| The incorrect 
parallel docking of the first two helices is largely resolved 
in the best sampled structure and the Q score improves 
considerably. Similar analysis for target T0281 (PDB ID 
IWHZ) shows incorrect long range contacts between the 
two otherwise properly oriented helices, and disordered 
intermediate interactions as in Fig. ^| Again the best 
sampled structure has these problems largely resolved. 

One amusing way to analyze predicted structures is to 
view the results of different structure prediction schemes 
as intermediates along a kinetic folding coordinate. How 
far did the simulated annealing get in the folding path- 
way? By mapping the likelihood of folding against 
its location on a folding free energy surface, we can assess 
how close the model structure is to the folded state in a 
kinetic sense. The energy function for the kinetic mod- 
eling is a Go model i.e. ideally non- frustrated energy 
function. The difference between the Go model and the 
structure prediction energy functions is a measure of the 
quality of those structure prediction schemes. A pairwise 
additive Go model was created based on the native struc- 
ture of the experimentally determined protein. As it has 
been discussed previously , this Go model has both a 
polypeptide backbone energy terms that are the same as 
in the structure prediction energy function as described 
by Eq. |31 and an interaction potential were the Gaussian 
interaction potential distances rf^ are determined by the 
native state formally described in Eq. 1121 




X-Ray Structure 



FIG. 8: Contact maps for the best submitted (Q=.36) and 
the best sampled (Q=.44) structures for target T0201. 




X-Ray Structure 



a ^-^ ' 

i<3-3 



]exp 



FIG. 9: Contact maps for the best submitted (Q=.31) and 
(12) the best sampled (Q=.42) structures for target T0230. 



9 




FIG. 10: Contact maps for the best submitted (Q=.34) and 
the best sampled (Q=.48) structures for target T0281. 



The interactions are defined in this minimal model as 
residues with greater the three residues in sequence sep- 
aration between 0°" - C"' ,0°" - C^^ .C^^ - 0°" - 
atom pairs. The weights 7go or the depth of the Gaus- 
sian wells are set to (.177, .048, .430) in order to approx- 
imately divided the interaction energy equally between 
the different distance classes as defined in the original 
structure prediction energy function. The width of the 
gaussians crf^ are defined by the sequence separation as 
before. Notice that the Go Hamiltonian does not contain 
a summation over a set of memory structures as in the 
AMH, this is because all of the contacts in this definition 
of a Go model uses only the native state. One hundred in- 
dependent simulations of this Go energy function are per- 
formed starting with the best structure of three different 
structure prediction groups. Pfold is then calculated by 
simply determining whether the simulation started from 
the model structure folds to the native structure or not. 
The results in Fig. 1111 compare three minimalist models, 
one of which (the Baker Group) has undergone a further 
atomistic refinement. The minimalist models are only a 
few ksT from the barrier's peak, they only infrequently 
cross it. It also suggests that a detailed less coarse grain 
sampling procedure maybe necessary for correctly assign- 
ing hydrophobic packing and hydrogen bonding patterns. 



F Q, Q contact for T0281 Go Model Temp=.93 




FIG. 11: Go Model Free Energy Surface with final prediction 
structures shown. The Pfold values for the three proteins are 
the Wolynes Group 0.07, Scheraga Group 0.02, and the Baker 
Group 0.97 with an error of +/- 0.1. 



The Next Generation in Structure Prediction 

Examining the contact maps of structures encountered 
during the GASP experiment, we observed that contacts 
between residues with a large separation in sequence can 
be inaccurate, even when most of the contacts within a 12 
residues sequence separation are native like. A different 
way of expressing this idea is the amount of funnelling is 
different within the different distance classes. When com- 
paring the quality of the intermediate range interactions 
in the sampled structures with the memories obtained 
with sequence analysis from the protein data bank, a dra- 
matic increase of native- like interactions is seen as shown 
in Fig. El While this was not used in the recent GASP 
exercise, we thought it would be interesting and straight 
forward to improve the prediction energy function by us- 
ing these first generation results as better memory struc- 
tures in the AMH. Sequence to structure alignments yield 
gap-less identity alignments thereby eliminating any pos- 
sibility of secondary structure registry shift irregularities. 

Different energy functions have been used to identify 
native like proteins from an ensemble of simulated struc- 
tures. Alternatively, one can rely on energy landscape 
ideas, and assume a mean field contact potential derived 
from the energy minima of the simulated energy function. 
This approach has the additional advantage, that it does 
not rely on using a distinct energy function: one is simply 
seeing how close simulated annealing was to completely 
accessing the global minimum of the prediction energy 
function. To select structures a pairwise Q denoted by a 
lower case q, is calculated between all of the ground state 
structures encountered in 200 independent simulations. 



10 



Total Q 

bhick = 1.07 dotted = 0.07 dashed = 




Local Distance Class 

black = 1.07 dotted = 0.07 dashed 





J. ' 




1 \ 1 \ '1 


v' ' ' ;' 


V ; 




I.I 


1 



Super Secondary Distance Class 
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Tertiary Distance Class 
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FIG. 12: These figures show the total Q, and the Q in the 
diflerent distance classes between PDB structures, structures 
from a temperature of 1, and a temperature near zero for 
structures used as inputs to AMH simulations. The lowest 
temperature show the largest improvement because they are 
fully collapsed. 



By dividing the inter-chain interactions under the same 
definitions as used in the energy function, the potential 
for improvements from such second generation structures 
over the original memories is considerable for protein 
256B. As seen in Fig. the low temperature struc- 
ture as identified by little q have an increased amount 
of native like contacts in all distance classes. This style 
of analysis also suggests potential changes in the energy 
function. The long distance in sequence interactions are 
also improved over that original memory used in the en- 
ergy function. In order to utilise this improvement the 
energy function in the distant interaction class was mod- 
ified. The original function used a multi-well contact 
potential, which does not use any information from the 
memory proteins. For this third distance class the next 
generation energy function uses associative memory con- 
tacts much as was done before for modelling with ho- 
mologues 0]. The energy function now takes the form 



E 



n N 



-4r)- (13) 



The parameters for this new distance class are taken from 
the second distance class. The total energy is defined over 
the set of memory structures as defined by Eq. 1141 



^ I pmodell 



1 



36 ^ AN 



(14) 



instead of using the values taken from the optimisation. 
Some next generation memory structures are more col- 
lapsed than the memory structures used in initial round 
of simulation. Furthermore the scaling is changed from 
the initial round of simulation's 1:1:1 scaling amongst the 
three different (local, super-secondary, tertiary) distance 
classes to 1.5:0.5:1 in an effort to approximate the equal 
division of energy in each distance class. To examine the 
equilibrium properties of this energy function, we need 
to estimate the glas s transition temperature. As pre- 
viously explored '431, we use the Kolmogorov-Smirnov 
test to determine if two independent simulations have 
been sampled from the same equilibrium distribution. 
This test ensures that simulations are equilibrated. Once 
the glass transition temperature (Tr-) is estimated using 
the Kolmogorov-Smirnov test, we can use standard tech- 
niques to quantify the equilibrium properties of differ- 
ent energy functions. The proteins we used for study 
of the next generation AMH strategy are cytochrome 
B562 (PDB ID 256b), HDEA (PDB ID 1BG8), because 
they are both of moderate size and one of them (1BG8) 
was not in the training set of proteins that optimized 
the original energy function. An additional advantage 
of this choice is these proteins have different fold types. 
According to CATH Q HDEA belongs to the orthog- 
onal bundle architecture, while cytochrome B562 repre- 
sents an up-down bundle. Using umbrella sampling com- 
bined with the weighted histogramming method, we are 
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P(q) constant temperature = 1.50 e 

solid -P (q) dashed -P|^(q) dotted-P ^^(q) 



F(Q) 256b extrapolated to the same temperature of 1 .3 

Solid-Origitial AMH Dotted-Nexl Generation AMH 




P(q) constant temperature = 1 .40 e 

solid - P (q) dashed -P (q( dotced-P (q) 




FIG. 13: Kolmogorov-Smirnov test shows the constant tem- 
perature simulation falhng out of equilibrium at a lower tem- 
perature of 1.4. The different probability distributions of 
structures between two independent simulations is no longer 
the same. 



able to sample parts of phase space that would rarely 
be encountered during a simulation |4^ . When using 
memories with a larger number of native contacts, we 
see improved free energy and energy profiles as shown in 
Fig. 1141 This is even more impressive when we consider 
this energy function has not yet been properly optimised 
for this new hamiltonian. For the other target, the re- 
sults are also not surprising. In this case the next gen- 
eration memories used to simulate this protein were not 
of greater structural quality than the initial set. Thus 
a very similar free energy profile was generated as seen 
in Fig. El Our use of q as an order parameter was suc- 
cessful in identifying the high Q protein for the 256B 
example. This is due to the highly funnelled character- 
istic of the first generation energy function. The original 
energy function for 1BG8 is not as funnelled so therefore 
there is poorer enrichment by scanning with little q. This 
limitation could be over come by increasing the amount 
of sampling of structures in the first generation simula- 
tions. More simulations would guarantee better structure 
as was demonstrated during the CASP5 exercise. This 
difference in the enrichment could be anticipated by us- 
ing the Kolmogorov-Smirnov measure to differentiate the 
distribution of the q values encountered between struc- 
tures derived from simulation and the protein databank. 




Local range in Sequence E(Q) 256b extrapolated T = 1.3 

Solid-Oriainal AMH Doited-Next Generation AMH 




Medium range in Sequence E(Q) 256b extrapolated T = 1.3 

Solid-Orisinal AMH Dotted-Next Generation AMH 
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Long range in Sequence E{Q) 256b extrapolated T = 1.3 

Solid-Original AMH Dotted-Next Generation AMH 
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FIG. 14: The free energy the two different energy functions for 
the protein 256B, shows roughly a 5-10 fcsT improvement for 
this protein. The primary improvements are in the medium 
and long range distance classes. 



12 



Total Q 

black = 0.05 dotted = memories 




F(Q) lbg8 extrapolated to the same temperature of 1.3 

Solid-Originiil AMH Dotted-Next Generaiion AMH 




diction schemes. It produces a scries of lessons for us 
and we hope for others. In the future, a more balanced 
efforts between the sampling and selection of structures 
from that ensemble would appear to be desirable. More 
efforts in selection would have clearly improved the re- 
sults submitted in CASP6. While it is was computation- 
ally impractical to quench all of the structures simulated 
during the prediction season, the comparison of the con- 
tact maps demonstrated further that tempering of the 
structure would have improved intermediate range or- 
dering. Using preliminary structures as input to a next 
generation of AMH modelling improves the quality of the 
prediction results. While these results may initially ap- 
pear to be model or energy function specific, we feel that 
any algorithm that uses structures as an input would 
benefit from similar next generation approaches. 
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FIG. 15: The free energy the two difTcrcnt energy functions 
for the protein 1BG8 show little improvement. The memories 
though show no enrichment in native contacts. 



Conclusion 

These case studies from our participation in the CASP 
experiment only provide a snap shot of our group's pre- 



The authors thank Joe Hegler, Zaida Luthey-Schulten, 
Garegin Papoian, and Marcio Von Muhlen for their key 
roles in developing codes used in this study and for 
many helpful discussions over the years. The efforts of 
P.G.W. are supported through the National Institutes 
of Health Grant 5R01GM44557. Computing resources 
were supplied by the Center for Theoretical Biologi- 
cal Physics through National Science Foundation Grants 
PHY0216576 and PHY0225630. 



[1] Moult, J.; Fidelis, K.; Zemla, A.; Hubbard, T. Proteins 

2003, 53 Suppl 6, 334-339. 
[2] Goldstein, R. A.; Luthey-Schulten, Z. A.; Wolynes, P. G. 

Proc Natl Acad Sai USA 1992, 89, 4918-4922. 
[3] Bryngelson, J. D.; Wolynes, P. G. Proc Natl Acad Sci 

USA 1987, 84, 7524-7528. 
[4] Anfinsen, C. B. Science 1973, 181, 223-230. 
[5] Go, N. Annu Rev Biophys and Bioeng 1983, 12, 183-210. 
[6] Koga, N.; Takada, S. J Mol Biol 2001, 313, 171-180. 
[7] Portman, J. J.; Takada, S.; Wolynes, P. G. Phys Rev 

Lett 1998, 81, 5237-5240. 
[8] Wales, D. Energy Landscapes; Cambridge University 

Press: Cambridge, UK, 2003. 
[9] Wheelan, S. J.; Marchler-Bauer, A.; Bryant, S. H. Bioin- 

formatics 2000, 16, 613-618. 
[10] George, R. A.; Heringa, J. J Mol Biol 2002, 316, 839- 

851. 

[11] Rigdon, D. ,J. Protein Eng 2002, 15, 65-77. 

[12] Hardin, C; Eastwood, M.; Prentiss, M.; Luthey- 
Schulten, Z.; Wolynes, P. G. Proc. Nat. Acad. Sci. U.S.A. 
2002, 100, 1679-1684. 



[13] Eastwood, M. P.; Hardin, C; Luthey-Schulten, Z.; 
Wolynes, P. G. IBM Systems Research 2001, 45, 475- 

497. 

[14] Korctke, K. K.; Luthcy-Schuhen, Z.; Wolynes, P. G. 

Protein Set 1996, 5, 1043-1059. 
[15] Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; 

Bhat, T. N.; Weissig, H.; Shindyalov, I. N.; 

Bourne, P. E. Nucl. Acids Res. 2000, 28, 235-242. 
[16] Shindyalov, I.; Bourne, P. Protein Engineering 1998, 

11, 739-747. 

[17] Friedrichs, M. S.; Wolynes, P. G. Science 1989, 246, 
371-373. 

[18] Friedrichs, M.; Wolynes, P. G. Tet Comp Meth 1990, 3, 
175. 

[19] Friedrichs, M. S.; Goldstein, R. A.; Wolynes, P. G. J 

Mol Biol 1991, 222, 1013-1034. 
[20] Hardin, C; Eastwood, M.; Luthey-Schulten, Z.; 

Wolynes, P. G. Proc Natl Acad Set USA 2000, 97, 14235- 

14240. 

[21] Goldstein, R.; Luthey-Schulten, Z. A.; Wolynes, P. G. 
Proc Natl Acad Sci USA 1992, 89, 9029-9033. 



13 



[22] Hopfield, J. J. Proc Natl Acad Sci USA 1982, 79, 2554- 
2558. 

[23] Ryckacrt, J.; Ciccotti, G.; Berendsen., H. J Comput Phys 

1977, 23, 327-341. 
[24] Ramachandran, G.; Sasisekharan, V. Adv Protein Chem 

1968, 23, 283-438. 
[25] Papoian, G. A.; Ulander, J.; Eastwood, M. P.; Luthey- 

Sclmltcri, Z.; Wolyncs, P. G. Proc Natl Acad Sci USA 

2004, 101, 3352-3357. 
[26] Papoian, G. A.; Wolynes, P. G. Biopolymers 2003, 68, 

333-349. 

[27] Papoian, G. A.; Ulander, J.; Wolynes, P. G. J Am Chem 

Soc 2003, 125, 9170-9178. 
[28] Maxfield, F. R.; Scheraga, H. A. Biochemistry 1979, 18, 

697-704. 

[29] Finkelstein, A. V. Phys Rev Lett 1998, 80, 4823-4825. 
[30] Altschul, S.; Madden, T.; Schaffer, A.; Zhang, J.; 

Zhang, Z.; Miller, W.; Lipman, D. Nucl. Acids Res. 

1997, 25, 3389-3402. 
[31] Stajich, J. E. et al. Genome Res. 2002, 12, 1611-1618. 
[32] Thompson, J.; Higgins, D.; Gibson, T. Nucl. Acids Res. 

1994, 22, 4673-4680. 
[33] Zhang, Y.; Kolinski, A.; Skolnick, J. Biophys J 2003, 

85, 1145-1164. 

[34] Hardin, C; Eastwood, M.; Prentiss, M.; Luthey- 



Schultcn, Z.; Wolynes, P. G. J Comput Chem 2002, 
23, 138-146. 

[35] Bonneau, R.; Tsai, J.; Ruczinski, I.; Chivian, D.; 

Rohl, C; Strauss, C. E. M.; Baker, D. Proteins 2001, 

Suppl 5, 119-126. 
[36] Zhou, H.; Zhou, Y. Proteins 2004, 54, 315-322. 
[37] Kussell, E.; Shakhnovich, E. I. Phys Rev Lett 2002, 89, 

168101. 

[38] Simons, K.; Kooperberg, C; Huang, E.; Baker, D. J. 

Mol. Biol. 1997, 268, 209-225. 
[39] Betancourt, M.; Skolnick, J. J Mol Biol 2004, 2, 635- 

649. 

[40] Saven, J. G.; Wolynes, P. G. J. Mol. Biol. 1996, 257, 
199-216. 

[41] Allen, M. P.; Tildcsloy, D. J. Computer Simulation of 

Liquids; Clarendon Press: New York, NY, USA, 1987. 
[42] Eastwood, M.; Hardin, C.; Luthey-Schulten, Z.; 

Wolynes, P. G. J Chem Phys 2003, 118, 8500-8512. 
[43] Du, R.; Pande, V.; A.Y., G.; Shakhnovich, E. I. J Chem 

Phys 1997, 108, 334-350. 
[44] Koretke, K. K.; Luthey-Schuhen, Z.; Wolyncs, P. G. 

Proc Natl Acad Sci USA 1998, 95, 2932-2937. 
[45] Pearl, F. et al. Nucl. Acids Res. 2005, 33, D247-251. 
[46] Kong, X.; Brooks HI, C. L. J Chem Phys 1996, 105, 

2414-2423. 



